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Disconnected operation is a mode of operation that enables a client to continue accessing critical 
data during temporary failures of a shared data repository. An important, though not exclusive, 
application of disconnected operation is in supporting portable computers. In this paper, we show 
that disconnected operation is feasible, efficient and usable by describing its design and imple- 
mentation in the Coda File System. The central idea behind our work is that caching of data 
now widely used for performance, can also be exploited to improve availability. 

Categories and Subject Descriptors: D.4.3 (Operating Systems]: File Systems 
Management-distributed file systems; D.4.5 [Operating Systems]: Reliability-Am* tolerance- 
D.4.8 [Operating Systems]: Performance— measurements 

General Terms: Design. Experimentation, Measurement, Performance, Reliability 

Additional Key Words and Phrases: Disconnected operation, hoarding, optimistic replication, 
reintegration, second-class replication, server emulation 



1. INTRODUCTION 

Every serious user of a distributed system has faced situations where critical 
work has been impeded by a remote failure. His frustration is particularly 
acute when his workstation is powerful enough to be used standalone, but 
has been configured to be dependent on remote resources. An important 
instance of such dependence is the use of data from a distributed file system. 

Placing data in a distributed file system simplifies collaboration between 
users, and allows them to delegate the administration of that data. The 
growing popularity of distributed file systems such as NFS [16] and AFS [19] 
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attests to the compelling nature of these considerations. Unfortunately, the 
users of these systems have to accept the fact that a remote failure at a 
critical juncture may seriously inconvenience them. 

How can we improve this state of affairs? Ideally, we would like to enjoy 
the benefits of a shared data repository, but be able to continue critical work 
when that repository is inaccessible. We call the latter mode of operation 
disconnected operation, because it represents a temporary deviation from 
normal operation as a client of a shared repository. 

In this paper we show that disconnected operation in a file system is indeed 
feasible, efficient and usable. The central idea behind our work is that 
caching of data, now widely used to improve performance, can also be 
exploited to enhance availability. We have implemented disconnected opera- 
tion in the Coda File System at Carnegie Mellon University. 

Our initial experience with Coda confirms the viability of disconnected 
operation. We have successfully operated disconnected for periods lasting one 
to two days. For a disconnection of this duration, the process of reconnecting 
and propagating changes typically takes about a minute. A local disk of 
100MB has been adequate for us during these periods of disconnection. 
Trace-driven simulations indicate that a disk of about half that size should be 
adequate for disconnections lasting a typical workday. 

2. DESIGN OVERVIEW 

Coda is designed for an environment consisting of a large collection of 
untrusted Unix 1 clients and a much smaller number of trusted Unix file 
servers. The design is optimized for the access and sharing patterns typical of 
academic and research environments. It is specifically not intended for 
applications that exhibit highly concurrent, fine granularity data access. 

Each Coda client has a local disk and can communicate with the servers 
over a high bandwidth network. At certain times, a client may be temporar- 
ily unable to communicate with some or all of the servers. This may be due to 
a server or network failure, or due to the detachment of a portable client 
from the network. 

Clients view Coda as a single, location-transparent shared Unix file sys- 
tem. The Coda namespace is mapped to individual file servers at the granu- 
larity of subtrees called volumes. At each client, a cache manager {Venus) 
dynamically obtains and caches volume mappings. 

Coda uses two distinct, but complementary, mechanisms to achieve high 
availability. The first mechanism, server replication, allows volumes to have 
read-write replicas at more than one server. The set of replication sites for a 
volume is its volume storage group (VSG). The subset of a VSG that is 
currently accessible is a client's accessible VSG (A VSG). The performance 
cost of server replication is kept low by caching on disks at clients and 
through the use of parallel access protocols. Venus uses a cache coherence 
protocol based on callbacks [91 to guarantee that an open file yields its latest 



1 Unix is a trademark of AT&T Bell Telephone Labs. 
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copy in the AVSG. This guarantee is provided by servers notifying clients 
when their cached copies are no longer valid, each notification being referred 
to as a 'callback break'. Modifications in Coda are propagated in parallel to 
all AVSG sites, and eventually to missing VSG sites. 

Disconnected operation, the second high availability mechanism used by 
Coda, takes effect when the AVSG becomes empty. While disconnected, 
Venus services file system requests by relying solely on the contents of its 
cache. Since cache misses cannot be serviced or masked, they appear 
as failures to application programs and users. When disconnection ends, 
Venus propagates modifications and reverts to server replication. Figure 1 
depicts a typical scenario involving transitions between server replication 
and disconnected operation. 

Earlier Coda papers [18, 19] have described server replication in depth. In 
contrast, this paper restricts its attention to disconnected operation. We 
discuss server replication only in those areas where its presence has signifi- 
cantly influenced our design for disconnected operation. 

3. DESIGN RATIONALE 

At a high level, two factors influenced our strategy for high availability. 
First, we wanted to use conventional, off-the-shelf hardware throughout 
our system. Second, we wished to preserve transparency by seamlessly inte- 
grating the high availability mechanisms of Coda into a normal Unix 
environment. 

At a more detailed level, other considerations influenced our design. These 
include the need to scale gracefully, the advent of portable workstations, the 
very different resource, integrity, and security assumptions made about 
clients and servers, and the need to strike a balance between availability and 
consistency. We examine each of these issues in the following sections. 

3.1 Scalability 

Successful distributed systems tend to grow in size. Our experience with 
Coda's ancestor, AFS, had impressed upon us the need to prepare for growth 
a priori, rather than treating it as an afterthought [17]. We brought this 
experience to bear upon Coda in two ways. First, we adopted certain mecha- 
nisms that enhance scalability. Second, we drew upon a set of general 
principles to guide our design choices. 

An example of a mechanism we adopted for scalability is callback-based 
cache coherence. Another such mechanism whole-file caching, offers the 
added advantage of a much simpler failure model: a cache miss can only 
occur on an open, never on a read, write, seek, or close. This, in turn, 
substantially simplifies the implementation of disconnected operation. A 
partial-file caching scheme such as that of AFS-4 [22], Echo [8] or MFS 
[1] would have complicated our implementation and made disconnected 
operation less transparent. 

A scalability principle that has had considerable influence on our design is 
the placing of functionality on clients rather than servers. Only if integrity or 
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security would have been compromised have we violated this principle. 
Another scalability principle we have adopted is the avoidance of system-wide 
rapid change. Consequently, we have rejected strategies that require election 
or agreement by large numbers of nodes. For example, we have avoided 
algorithms such as that used in Locus [23J that depend on nodes achieving 
consensus on the current partition state of the network. 

3.2 Portable Workstations 

Powerful, lightweight and compact laptop computers are commonplace today. 
It is instructive to observe how a person with data in a shared file system 
uses such a machine. Typically, he identifies files of interest and downloads 
them from the shared file system into the local name space for use while 
isolated. When he returns, he copies modified files back into the shared file 
system. Such a user is effectively performing manual caching, with write-back 
upon reconnection! 

Early in the design of Coda we realized that disconnected operation could 
substantially simplify the use of portable clients. Users would not have to 
use a different name space while isolated, nor would they have to man- 
ually propagate changes upon reconnection. Thus portable machines are a 
champion application for disconnected operation. 

The use of portable machines also gave us another insight. The fact that 
people are able to operate for extended periods in isolation indicates that they 
are quite good at predicting their future file access needs. This, in turn, 
suggests that it is reasonable to seek user assistance in augmenting the 
cache management policy for disconnected operation. 

Functionally, involuntary disconnections caused by failures are no different 
from voluntary disconnections caused by unplugging portable computers. 
Hence Coda provides a single mechanism to cope with all disconnections. Of 
course, there may be qualitative differences: user expectations as well as the 
extent of user cooperation are likely to be different in the two cases. 

3.3 First- vs. Second-Class Replication 

If disconnected operation is feasible, why is server replication needed at 
all? The answer to this question depends critically on the very different 
assumptions made about clients and servers in Coda. 

Clients are like appliances: they can be turned off at will and may be 
unattended for long periods of time. They have limited disk storage capacity, 
their software and hardware may be tampered with, and their owners may 
not be diligent about backing up the local disks. Servers are like public 
utilities: they have much greater disk capacity, they are physically secure, 
and they are carefully monitored and administered by professional staff. 

It is therefore appropriate to distinguish between first-class replicas on 
servers, and second-class replicas (i.e., cache copies) on clients. First-class 
replicas are of higher quality: they are more persistent, widely known, 
secure, available, complete and accurate. Second-class replicas, in contrast, 
are inferior along all these dimensions. Only by periodic revalidation with 
respect to a first-class replica can a second-class replica be useful. 
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known. Thus, an entire user community could be at the mercy of a single 
errant client for an unbounded amount of time. 

Placing a time bound on exclusive or shared control, as done in the case of 
leases [7], avoids this problem but introduces others. Once a lease expires, a 
disconnected client loses the ability to access a cached object, even if no one 
else in the system is interested in it. This, in turn, defeats the purpose of 
disconnected operation which is to provide high availability. Worse, updates 
already made while disconnected have to be discarded. 

An optimistic approach has its own disadvantages. An update made at one 
disconnected client may conflict with an update at another disconnected or 
connected client. For optimistic replication to be viable, the system has to be 
more sophisticated. There needs to be machinery in the system for detecting 
conflicts, for automating resolution when possible, and for confining dam- 
age and preserving evidence for manual repair. Having to repair conflicts 
manually violates transparency, is an annoyance to users, and reduces the 
usability of the system. 

We chose optimistic replication because we felt that its strengths and 
weaknesses better matched our design goals. The dominant influence on our 
choice was the low degree of write-sharing typical of Unix. This implied that 
an optimistic strategy was likely to lead to relatively few conflicts. An 
optimistic strategy was also consistent with our overall goal of providing the 
highest possible availability of data. 

In principle, we could have chosen a pessimistic strategy for server replica- 
tion even after choosing an optimistic strategy for disconnected operation. 
But that would have reduced transparency, because a user would have faced 
the anomaly of being able to update data when disconnected, but being 
unable to do so when connected to a subset of the servers. Further, many of 
the previous arguments in favor of an optimistic strategy also apply to server 
replication. 

Using an optimistic strategy throughout presents a uniform model of the 
system from the user's perspective. At any time, he is able to read the latest 
data in his accessible universe and his updates are immediately visible to 
everyone else in that universe. His accessible universe is usually the entire 
set of servers and clients. When failures occur, his accessible universe 
shrinks to the set of servers he can contact, and the set of clients that they, in 
turn, can contact. In the limit, when he is operating disconnected, his 
accessible universe consists of just his machine. Upon reconnection, his 
updates become visible throughout his now-enlarged accessible universe. 

4. DETAILED DESIGN AND IMPLEMENTATION 

In describing our implementation of disconnected operation, we focus on the 
client since this is where much of the complexity lies. Section 4.1 describes 
the physical structure of a client, Section 4.2 introduces the major states 
of Venus, and Sections 4.3 to 4.5 discuss these states in detail. A descrip- 
tion of the server support needed for disconnected operation is contained in 
Section 4.5. 
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Structure of a Coda client. 



4.1 Client Structure 
4.2 Venus States 
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Fig. 3 Venus states and transitions. When disconnected, Venus is in the emulation state It 
^ansm.te to re.ntegrat.on upon successful reconnects to an AVSG member, and thence" to 
hoarding, where it resumes connected operation. 



cache with its AVSG, and then reverts to the hoarding state. Since 
all volumes may not be replicated across the same set of servers, Venus can 
be m different states with respect to different volumes, depending on failure 
conditions m the system. 

4.3 Hoarding 

The hoarding state is so named because a key responsibility of Venus in this 
state is to hoard useful data in anticipation of disconnection. However, this is 
not its only responsibility. Rather, Venus must manage its cache in a manner 
that balances the needs of connected and disconnected operation For in- 
stance a user may have indicated that a certain set of files is critical but may 
currently be using other files. To provide good performance, Venus must 
cache the latter files. But to be prepared for disconnection, it must also cache 
the former set of files. 
Many factors complicate the implementation of hoarding: 

-Pile reference behavior, especially in the distant future, cannot be 
predicted with certainty. 

—Disconnections and ^connections are often unpredictable. 

-The true cost of a cache miss while disconnected is highly variable and 
hard to quantify. 

-Activity at other clients must be accounted for, so that the latest version of 

an object is in the cache at disconnection. 
-Since cache space is finite, the availability of less critical objects may have 

to be sacrificed in favor of more critical objects. 
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« Personal files 

a /coda/usr/ j}k d+ 

a /coda/usr/j J k /papers 100 :d+ 

a /coda/usr/J jk/papers/sosp 1000 :d+ 



System files 
/usr/bin 10O:d+ 
/usr/etc 100 :d+ 
/usr/include 100 :d+ 
/usr/lib 100:d+ 
/usr/local/gnu d+ 
/usr/local/rcs d+ 
/usr/ucb d+ 



(a) 



« Xli files 

# (from Xll malntalner) 

a /usr/XIX/bln/X 

a /usr/Xll/bin/Xvga 

a /usr/Xll/bin/rowm 

a /usr/Xll/bin/startx 

a /usr/Xll/bin/xcloclc 

a /usr/xil/bin/xlnlt 

a /usr/Xll/bln/xterro 

a /usr/Xll/lnclude/Xll/bltmaps c+ 

a /usr/Xll/1 lb/app-de faults d+ 

a /usr/Xll/lib/fonts/raisc c+ 

a /usr/Xll/llb/system.mwmrc 

(b) 



# Venus source files 

t (shared among Coda developers) 

a /coda/pro ject/coda/src/venus 100 :c+ 

a /coda /project /coda /Include 100:c+ 

a /coda/project/coda/lib c+ 

(c) 

Fig. 4. Sample hoard profiles. These are typical hoard profiles provided by a Coda user, an 
application maintainer, and a group of project developers. Each profile is interpreted separately 
by the HDB front-end program. The V at the beginning of a line indicates an add-entry 
command. Other commands are delete an entry, clear all entries, and list entries. The modifiers 
following some pathnames specify nondefault priorities (the default is 10) and/or metaexpansion 
for the entry. Note that the pathnames beginning with '/usr* are actually symbolic links into 
'/coda*. 



To address these concerns, we manage the cache using a prioritized 
algorithm, and periodically reevaluate which objects merit retention in the 
cache via a process known as hoard walking. 

4.3.1 Prioritized Cache Management. Venus combines implicit and ex- 
plicit sources of information in its priority-based cache management algo- 
rithm. The implicit information consists of recent reference history, as in 
traditional caching algorithms. Explicit information takes the form of a 
per-workstation hoard database (HDB), whose entries are pathnames identi- 
fying objects of interest to the user at the workstation. 

A simple front-end program allows a user to update the HDB using 
command scripts called hoard profiles, such as those shown in Figure 4. 
Since hoard profiles are just files, it is simple for an application maintainer to 
provide a common profile for his users, or for users collaborating on a project 
to maintain a common profile. A user can customize his HDB by specifying 
different combinations of profiles or by executing front-end commands inter- 
actively. To facilitate construction of hoard profiles, Venus can record all file 
references observed between a pair of start and stop events indicated by 
a user. 

To reduce the verbosity of hoard profiles and the effort needed to maintain 
them, Venus supports meta-expansion of HDB entries. As shown in Figure 4, 
if the letter V (or *d*) follows a pathname, the command also applies to 
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immediate children (or all descendants). A *+' following the V or 'd' indi- 
cates that the command applies to all future as well as present children or 
descendants. A hoard entry may optionally indicate a hoard priority, with 
higher priorities indicating more critical objects. 

The current priority of a cached object is a function of its hoard priority as 
well as a metric representing recent usage. The latter is updated continu- 
ously in response to new references, and serves to age the priority of objects 
no longer in the working set. Objects of the lowest priority are chosen as 
victims when cache space has to be reclaimed. 

To resolve the pathname of a cached object while disconnected, it is 
imperative that all the ancestors of the object also be cached. Venus must 
therefore ensure that a cached directory is not purged before any of its 
descendants. This hierarchical cache management is not needed in tradi- 
tional file caching schemes because cache misses during name translation 
can be serviced, albeit at a performance cost. Venus performs hierarchical 
cache management by assigning infinite priority to directories with cached 
children. This automatically forces replacement to occur bottom-up. 

4.3.2 Hoard Walking. We say that a cache is in equilibrium, signifying 
that it meets user expectations about availability, when no uncached object 
has a higher priority than a cached object. Equilibrium may be disturbed as a 
result of normal activity. For example, suppose an object, A, is brought into 
the cache on demand, replacing an object, B. Further suppose that B is 
mentioned in the HDB, but A is not. Some time after activity on A ceases, 
its priority will decay below the hoard priority of B. The cache is no longer in 
equilibrium, since the cached object A has lower priority than the uncached 
object B. 

Venus periodically restores equilibrium by performing an operation known 
as a hoard walk. A hoard walk occurs every 10 minutes in our current 
implementation, but one may be explicitly required by a user prior to 
voluntary disconnection. The walk occurs in two phases. First, the name 
bindings of HDB entries are reevaluated to reflect update activity by other 
Coda clients. For example, new children may have been created in a direc- 
tory whose pathname is specified with the * 4- ' option in the HDB. Second, the 
priorities of all entries in the cache and HDB are reevaluated, and objects 
fetched or evicted as needed to restore equilibrium. 

Hoard walks also address a problem arising from callback breaks. In 
traditional callback-based caching, data is refetched only on demand after a 
callback break. But in Coda, such a strategy may result in a critical object 
being unavailable should a disconnection occur before the next reference to 
it. Refetching immediately upon callback break avoids this problem, but 
ignores a key characteristic of Unix environments: once an object is modified, 
it is likely to be modified many more times by the same user within a short 
interval [15, 6]. An immediate refetch policy would increase client-server 
traffic considerably, thereby reducing scalability. 

Our strategy is a compromise that balances availability, consistency and 
scalability. For files and symbolic links, Venus purges the object on callback 
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modified fifes. When lo^pti / s ' 0^,°^^ * 

to alio. ^ tTsete^Wk?^^™?'- Asecnd^bS^ 
Uurd approach is to allow PortSi t Z rl "l 811 ' Whi,e disconnected A 
oat to removable media euchSrfe^^ c " d » *° d RV M to be written 
4.5 Reintegration 

V — P— in cbanein, 

4-5.1 Replay Algorithm Tk 
is accomplished in two steps In tKfw " ^ ch&ngeS from clien ' to AVSG 
new objects and uses Lm °^ ins P^anent fi<£ 

This step is avoided in many cases S W v^ 3 ? fids in th e replay log 
P^anent fids in advance of nST^Z^T * SmaI1 ^ of 

step, the replay log is shinrwi- mie m the hoarding state. In the Z^ n A 
independently at ead. metbe ^ pandW to *e AVSG, ana exetS 
single transaction, ^T^tLTeZ ^7 the ^ ^n a 

The replay algorithm consistf of T 18 detecte <i. 
Parsed, a transaction is E£f and ST ^T" PhaSe one «» U* is 
locked. In phase two, each opemtl^ • f^, referenced in the log are 
executed. The validation ZSS^ZS^f- " «* th - 

protection, and disk space checks V ° + de *ection as well as integrity 
execution during replay is itnt^ t^S^ « ° f Store operant 
store an empty yjfe ^ created ^T. ^ m connected mode. For a 

AM£S ^^^^ 

^ei^-sutT^ 

of cached objects referTntl bTSe foe tf* ^ l0g "* Priority 
out the replay log to a local ren£v £ • ^^^n fails, Venus writes 
The log and all corresponZg Se^nSet ° f Un « ^ ^ 

quent references will cause rffetch a?2? "* PUrged ' 80 *at subse 
tool is provided which alWs St ♦ f CUmint Contents at the AVSGA 
compare it to the state Tt£ I?^ "f* *• of a «playfile 

entirety. AVSG - 811(1 replay it selectively or hi ife 
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Reintegration at finer granularity than a volume would reduce the latency 
perceived by clients, improve concurrency and load balancing at servers, and 
reduce user effort during manual replay. To this end, we are revising our 
implementation to reintegrate at the granularity of subsequences of depen- 
dent operations within a volume. Dependent subsequences can be identified 
using the precedence graph approach of Davidson [41. In the revised imple- 
mentation Venus will maintain precedence graphs during emulation, and 
pass them to servers along with the replay log. 

4.5.2 Conflict Handling. Our use of optimistic replica control means that 
the disconnected operations of one client may conflict with activity at servers 
or other disconnected clients. The only class of conflicts we are concerned 
with are write /write conflicts. Read/ write conflicts are not relevant to the 
Unix file system model, since it has no notion of atomicity beyond the 
boundary of a single system call. 

The check for conflicts relies on the fact that each replica of an object is 
tagged with a storeid that uniquely identifies the last update to it. During 
phase two of replay, a server compares the storeid of every object mentioned 
in a log entry with the storeid of its own replica of the object. If the 
comparison indicates equality for all objects, the operation is performed and 
the mutated objects are tagged with a new storeid specified in the log entry. 

If a storeid comparison fails, the action taken depends on the operation 
being validated. In the case of a store of a file, the entire reintegration is 
aborted. But for directories, a conflict is declared only if a newly created 
name collides with an existing name, if an object updated at the client or the 
server has been deleted by the other, or if directory attributes have been 
modified at the server and the client. This strategy of resolving partitioned 
directory updates is consistent with our strategy in server replication [111, 
and was originally suggested by Locus [23]. 

Our original design for disconnected operation called for preservation of 
replay files at servers rather than clients. This approach would also allow 
damage to be confined by marking conflicting replicas inconsistent and 
forcing manual repair, as is currently done in the case of server replication. 
We are awaiting more usage experience to determine whether this is indeed 
the correct approach for disconnected operation. 

5. STATUS AND EVALUATION 

Today, Coda runs on IBM RTs, Decstation 3100s and 5000s, and 386-based 
laptops such as the Toshiba 5200. A small user community has been using 
Coda on a daily basis as its primary data repository since April 1990. All 
development work on Coda is done in Coda itself. As of July 1991 there were 
nearly 350MB of triply replicated data in Coda, with plans to expand to 2GB 
in the next few months. 

A version of disconnected operation with minimal functionality was demon- 
strated in October 1990. A more complete version was functional in January 
1991, and is now in regular use. We have successfully operated disconnected 
for periods lasting one to two days. Our experience with the system has been 
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Table L Time for Reintegration 




Elapsed Tlm« 
(seconds) 


Rein 
Total 


tegratfoo 1 
Alloc Fid 


r ime (secoi 
Replay 


ids) 
COP2 


Size of R 
Records 


eplayLog 
Bytes 


Data Back-Fetcfaed 
(Bytes) 


Andrew 
Benchmark 


288(3) 


«(2) 


4(2) 


29(1) 


10(1) 


223 


65.010 


U4I315 


Veous 
Make 


3.271 (28) 


52(4) 


1(0) 


40(1) 


10(3) 


193 


65.919 


2,990.120 



siSSn^rj^rs^ irs^r c,ient ™ ioomb «*» 

The values shown above ar^he^ea^ oTSri T U2MB """"^ 400MB 

deviations. ° f three trlals - F| g"res » parentheses are standard 



^^^zz^X? ref — under devel — 

to^S" 0 ? 1 ! S6Cti0nS WG provide ^""litative and quantitative answers 
to three important questxons pertaining to disconnected operation 

(1) How long does reintegration take? 

(2) How large a local disk does one need? 

(3) How likely are conflicts? 

5.1 Duration of Reintegration 

In our experience, typical disconnected sessions of editing and nrn^m 
development lasting a few hours required about a minute fof reinte^Sn 

gration fames after disconnected execution of two well-defined tasks ThTnZt 

tasks. Presents the reintegration times for these 

cate Lr^nt fi^tW 1011 ° f ^ CBa ^ u te: &e to alio- 

only if server replfcatto,, is ""P 0 ™ 1 " -» »™i<te<i 

oecause the Andrew benchmark uses the file system more intensively SeT 
of entries in the replay log is smaller. This is because much Zre file dXis 
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Time (hours) 

Fig. 5. High-water mark of Cache usage. This graph is based on a total of 10 traces from 5 
active Coda workstations. The curve labelled "Avg" corresponds to the values obtained by 
averaging the high-water marks of all workstations. The curves labeled "Max" and "Min" plot 
the highest and lowest values of the high-water marks across all workstations. Note that the 
high-water mark does not include space needed for paging, the HDB or replay logs. 



back-fetched in the third phase of the replay. Finally, neither task involves 
any think time. As a result, their reintegration times are comparable to 
that after a much longer, but more typical, disconnected session in our 
environment. 

5.2 Cache Size 

A local disk capacity of 100MB on our clients has proved adequate for our 
initial sessions of disconnected operation. To obtain a better understanding of 
the cache size requirements for disconnected operation, we analyzed file 
reference traces from our environment. The traces were obtained by instru- 
menting workstations to record information on every file system operation, 
regardless of whether the file was in Coda, AFS, or the local file system. 

Our analysis is based on simulations driven by these traces. Writing and 
validating a simulator that precisely models the complex caching behavior of 
Venus would be quite difficult. To avoid this difficulty, we have modified 
Venus to act as its own simulator. When running as a simulator, Venus is 
driven by traces rather than requests from the kernel. Code to communicate 
with the servers, as well as code to perform physical I/O on the local file 
system are stubbed out during simulation. 

Figure 5 shows the high-water mark of cache usage as a function of time. 
The actual disk size needed for disconnected operation has to be larger, since 
both the explicit and implicit sources of hoarding information are imperfect. 
From our data it appears that a disk of 50-60MB should be adequate for 
operating disconnected for a typical workday. Of course, user activity that is 
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FiS Sl^m ^Sir J3J r simulations in three ways 

disconnection. SS2^ , ^^ 1 ^S?^ l8 ? ^ Um *" ^ ° f 
by obtaining traces from Ln ZJ^JZ? 3 br ° ader ran ^ e of use r activity 
will evaluate th?e£ToFhol£Zl T ™ ° W envi ™n me nt. Third, we 
profi.es that haiten t^ST^^T™ ^ 
5.3 Likelihood of Conflicts 

strvTr^^ *r near,y a year, we have 

different network £rt£oL ^Se ^225 T an ob * ct * 

subject to at least fcee ritici!^ £ ^ 40 S tWs observ ation is 
being cautious, knowing thai Z^JlTV * that our users are 

Second, perhaps con^^lf^e^ 1 ^* - ^P-imental system, 
community grows lar^r -tLwa T problem onl y as the Coda user 
will lead to ST, ™Sco^ PCrhapS 6Xtended ^tary disconnections 

^^^e^^ ° f «f* ^ larger scale, we instru- 
400 computed science facultT stTrrTT SerVerS are used b * °ver 

program development? atd U edu'ca^ ^^^1 5* 
cant amount of collaborative activkv fil P ^ mCludeS 3 signifi ' 
and makes the same kind of «™ T ° da IS desce *ded from AFS 

mate how fi^cSkS^TTST' We ^ USe this data ^ 
environment. oe n oooa were to replace AFS in our 

Every time a user modifies an AJ?<3 j- 
identity with that of the user ZhTJS~*F or directory, we compare his 
the time interval between^uraUons F or a* ST ™ utation - We also note 
for update is counted a^ a motion TZ2£} °^ ^ 0,088 &fter an ° pen 
counted. For directories all 0 n!2Z: k "fi W " te °P er ations are not 
as mutations. xuns cnat modify a directory are counted 

is Scr: e °ttr °; 0 r a ^ of tweive months - 

pnvec volumi usedXlX^Sve^ "f* "~ dat *> 

program binaries, libraries, header A^and Ith^Z ^ ^l™*** contail ^g 
a project volume has about 5w files and^^Zi"^ ° Q avera ^' 
volume has about 1600 files and 13fl Z££ ^T"* 0 "^ and a system 
smaller, averaging about So^esl^^^ * »" 

Place rnuch of their data m their project volutes ' " U88M ^ 

^* d Z7t£:t™ J* "f*^ — by the previous 

less than a day apaV^TaTmo^ rlf'w ST modif * n S the same object 
degree of wnte-sharing on p^oCct fifl° n^" ^ to 866 the 
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write-sharing is even more striking: more than 99.5% of all mutations are by 
Ae prexnous writer, and the chances of two different users modifying ^ 
same object within a week are less than 0.4%! This data is highly encourag- 
StST ♦ If 0 ' 1 * ° f Vi6W ° f °P timistic replication. It suggests that conflicts 
Tuvfro * 3 S6n0US Pr ° blem if W6re replaC6d * Co** in our 



6. RELATED WORK 



SS-J 1 t ^ -, m * GXpl0itS f0r both Performance and high 

availability while preserving a high degree of transparency. We are aware of 
no other system, published or unpublished, that duplicates this key aspect of 

svSrfsTnf^ ^ 10 ^ 1<>Cal rem ° te Dame S P aces » the ^"ar ^ 
S™ 1 . 201 Provided rudimentary support for disconnected operation. But 
since this was not its primary goal, Cedar did not provide support for 
hoarding, transparent reintegration or conflict detection. Files were ver- 

S£Ll? ™ Utable ; and a Cedar m ^ger could substitute a cached 

version of a file on reference to an unqualified remote file whose server was 
£TwTiS' ^° WeVer ; implementors of Cedar observe that this capabil- 

2LS?v • 6 T l0lted SInce remote files w ere normally referenced by 
specific version number. J 

a ^i\ t nd Schroed , er Pointed out the possibility of "stashing" data for 
availability in an early discussion of the Echo file system [14] However a 
more recent description of Echo [8] indicates that it uses stashing only for the 
highest levels of the naming hierarchy. 

The FACE file system [3] uses stashing but does not integrate it with 
caching. The lack of integration has at least three negative consequences! 

SiSL / anSpare , nCy because applications deal with two 

different name spaces, with different consistency properties. Second, utiliza- 

nXnfJrV SPaCe 18 t0 be mUch WOrse - ^ird, recent usage 

information from cache management is not available to manage the stash 

^ t ST ° n f ACE d ° eS DOt re P° rt on how mu <* the lacTof 
integration detracted from the usability of the system 

th^.M l A i ^ ti ° n " SPeCifiC ™ of disconnected operation was implemented in 

™J£ ff^ ■ t -, MIT [121 - PCMML aUowed clients to disconnect, 

mampulate existing mail messages and generate new ones, and resynchro- 
mze with a central repository at reconnection. Besides relying heavily on the 
semantics of mail, PCMAIL was less transparent than Co£ sinoe required 
manual resynchronization as well as preregistration of clients with servers 
bv TXTv£ 1 ° P 5 nUSt l C re P licat i°° in distributed file systems was pioneered 
by Locus [23]. Since Locus used a peer-to-peer model rather than a client- 
server model, availability was achieved solely through server replication. 
There was no notion of caching, and hence of disconnected operation 

Coda has benefited in a general sense from the large body of work on 
transparency and performance m distributed file systems. In particular, Coda 
owes much to AFS [19], from which it inherits its model of trus.^nd 
integrity, as well as its mechanisms and design philosophy for scalability. 
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7. FUTURE WORK 

Disconnected operation in Coda is a facility under active development In 
earlier sections of this paper we described work in progress in the areas of log 
optimization, granularity of reintegration, and evaluation of hoarding Much 
additional work is also being done at lower levels of the system In this 
section we consider two ways in which the scope of our work may be 
broadened. 

An excellent opportunity exists in Coda for adding transactional support to 
Unix. Explicit transactions become more desirable as systems scale to hun- 
dreds or thousands of nodes, and the informal concurrency control of Unix 
becomes less effective. Many of the mechanisms supporting disconnected 
operation such as operation logging, precedence graph maintenance, and 
conflict checking would transfer directly to a transactional system using 
optimists concurrency control. Although transactional file systems are not a 
new idea, no such system with the scalability, availability and performance 
properties of Coda has been proposed or built. 

A different opportunity exists in extended Coda to support weakly con- 
nected operation, in environments where connectivity is intermittent or of 
low bandwidth. Such conditions are found in networks that rely on voice-grade 
lines, or that use wireless technologies such as packet radio. The ability to 
mask failures, as provided by disconnected operation, is of value even with 
weak connectivity. But techniques which exploit and adapt to the communi- 
cation opportunities at hand are also needed. Such techniques may include 
more aggressive write-back policies, compressed network transmission 
partial file transfer and caching at intermediate levels. 

8. CONCLUSION 

Disconnected operation is a tantalizingly simple idea. All one has to do is to 
preload one's cache with critical data, continue normal operation until dis- 
connection, log all changes made while disconnected and replay them upon 
reconnection. 

Implementing disconnected operation is not so simple. It involves major 
modifications and careful attention to detail in many aspects of cache man- 
agement. While hoarding, a surprisingly large volume and variety of interre- 
lated state has to be maintained. When emulating, the persistence and 
integrity of client data structures become critical. During reintegration 
there are dynamic choices to be made about the granularity of reintegration' 

Only ra hindsight do we realize the extent to which implementations of 
tradational caching schemes have been simplified by the guaranteed presence 
of a lifeline to a first-class replica. Purging and refetching on demand a 
strategy often used to handle pathological situations in those implementa- 
tions, is not viable when supporting disconnected operation. However, the 
obstacles to realizing disconnected operation are not insurmountable. Rather, 
the central message of this paper is that disconnected operation is indeed 
feasible, efficient and usable. 

One way to view our work is to regard it as an extension of the idea of 
write-back caching. Whereas write-back caching has hitherto been used for 
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performance we have shown that it can be extended to mask temporary 
H t br ° ader ™Z iS that ^-^ected operation allow, 3 
tT J "S*" 0f /" to ^ and mterdepe^^ in a distributed 

system. Under favorable conditions, our approach provides all the benefited 

a^ tu ?TrmZ eS Zr W UDfaVOrable -^ions, it provides tnSuel 
nSotlt Critlcal 1 data We are certain that disconnected operation will 

iS^SSSS? imP ° rtant 38 diStribUted SyStemS *™ iD div ~ si * 
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